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Abstract 

This article outlines a new method of locating discourse 
OS boundaries based on lexical cohesion and a graphical 
ON technique called dotplotting. The application of dot- 
plotting to discourse segmentation can be performed ei- 
<—j ther manually, by examining a graph, or automatically, 
^ using an optimization algorithm. The results of two ex- 
periments involving automatically locating boundaries 
[- — between a series of concatenated documents are pre- 
sented. Areas of application and future directions for 
j. this work are also outlined. 
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Introduction 

In general, texts are "about" some topic. That is, the 
sentences which compose a document contribute infor- 
mation related to the topic in a coherent fashion. In all 
but the shortest texts, the topic will be expounded upon 
through the discussion of multiple subtopics. Whether 
the organization of the text is hierarchical in nature, 
as described in (Grosz and Sidner, 1986), or linear, as 
examined in (Skorochod'ko, 1972), boundaries between 
subtopics will generally exist. 

In some cases, these boundaries will be explicit and 
will correspond to paragraphs, or in longer texts, sec- 
tions or chapters. They can also be implicit. Newspa- 
per articles often contain paragraph demarcations, but 
less frequently contain section markings, even though 
lengthy articles often address the main topic by dis- 
cussing subtopics in separate paragraphs or regions of 
the article. 

Topic boundaries are useful for several different tasks. 
Hearst and Plaunt (1993) demonstrated their usefulness 
for information retrieval by showing that segmenting 
documents and indexing the resulting subdocuments 
improves accuracy on an information retrieval task. 
Youmans (1991) showed that his text segmentation al- 
gorithm could be used to manually find scene bound- 
aries in works of literature. Morris and Hirst (1991) at- 
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tempted to confirm the theories of discourse structure 
outlined in (Grosz and Sidner, 1986) using information 
from a thesaurus. In addition, Kozima (1993) specu- 
lated that segmenting text along topic boundaries may 
be useful for anaphora resolution and text summariza- 
tion. 

This paper is about an automatic method of finding 
discourse boundaries based on the repetition of lexi- 
cal items. Halliday and Hasan (1976) and others have 
claimed that the repetition of lexical items, and in par- 
ticular content-carrying lexical items, provides coher- 
ence to a text. This observation has been used implic- 
itly in several of the techniques described above, but 
the method presented here depends exclusively on it. 

Methodology 

Church (1993) describes a graphical method, called dot- 
plotting, for aligning bilingual corpora. This method 
has been adapted here for finding discourse boundaries. 
The dotplot used for discovering topic boundaries is cre- 
ated by enumerating the lexical items in an article and 
plotting points which correspond to word repetitions. 
For example, if a particular word appears at word po- 
sitions x and j in a text, then the four points corre- 
sponding to the cartesian product of the set containing 
these two positions with itself would be plotted. That 
is, (x,x), (x,y), (y,x) and (y,y) would be plotted on 
the dotplot. 

Prior to creating the dotplot, several filters are ap- 
plied to the text. First, since closed-class words carry 
little semantic weight, they are removed by filtering 
based on part of speech information. Next, the remain- 
ing words are lemmatized using the morphological anal- 
ysis software described in (Karp et at, 1992). Finally, 
the lemmas are filtered to remove a small number of 
common words which are regarded as open-class by the 
part of speech tag set, but which contribute little to the 
meaning of the text. For example, forms of the verbs 
BE and HAVE are open class words, but are ubiquitous 
in all types of text. Once these steps have been taken, 
the dotplot is created in the manner described above. A 
sample dotplot of four concatenated Wall Street Jour- 
nal articles is shown in figure 1. The real boundaries 




between documents are located at word positions 1085, 
2206 and 2863. 

The word position in the file increases as values in- 
crease along both axes of the dotplot. As a result, the 
diagonal with slope equal to one is present since each 
word in the text is identical to itself. The gaps in this 
line correspond to points where words have been re- 
moved by one of the filters. Since the repetition of lexi- 
cal items occurs more frequently within regions of a text 
which are about the same topic or group of topics, the 
visually apparent squares along the main diagonal of 
the plot correspond to regions of the text. Regions are 
delimited by squares because of the symmetry present 
in the dotplot. 

Although boundaries may be identified visually using 
the dotplot, the plot itself is unnecessary for the dis- 
covery of boundaries. The reason the regions along the 
diagonal are striking to the eye is that they are denser. 
This fact leads naturally to an algorithm based on max- 
imizing the density of the regions within squares along 
the diagonal, which in turn corresponds to minimizing 
the density of the regions not contained within these 
squares. Once the densities of areas outside these re- 
gions have been computed, the algorithm begins by se- 
lecting the boundary which results in the lowest outside 
density. Additional boundaries are added until either 
the outside density increases or a particular number 
of boundaries have been added. Potential boundaries 
are selected from a list of either sentence boundaries or 
paragraph boundaries, depending on the experiment. 

More formally, let n be the length of the concatena- 
tion of articles; let m be the number of unique tokens 
(after lemmatization and removal of words on the stop 
list); let B be a list of boundaries, initialized to contain 
only the boundary corresponding to the beginning of 
the series of articles, 0. Maintain B in ascending order. 
Let i be a potential boundary; let P = B U {i}, also 
sorted in ascending order; let V x>y be a vector contain- 



ing the word counts associated with word positions x 
through y in the concatenation. Now, find the i such 
that the equation below is minimized. Repeat this min- 
imization, inserting i into B, until the desired number 
of boundaries have been located. 

f? 2 (Pj-Pj-i)(n-Pj) 

The dot product in the equation reveals the similar- 
ity between this method and Heart and Plaunt's (1993) 
work which was done in a vector-space framework. The 
crucial difference lies in the global nature of this equa- 
tion. Their algorithm placed boundaries by comparing 
neighboring regions only, while this technique compares 
each region with all other regions. 

A graph depicting the density of the regions not en- 
closed in squares along the diagonal is shown in figure 
2. The y-coordinate on this graph represents the den- 
sity when a boundary is placed at the corresponding 
location on the x-axis. These data are derived from 
the dotplot shown in figure 1. Actual boundaries corre- 
spond to the most extreme minima — those at positions 
1085, 2206 and 2863. 

Results 

Since determining where topic boundaries belong is a 
subjective task, (Passoneau and Litman, 1993), the pre- 
liminary experiments conducted using this algorithm 
involved discovering boundaries between concatenated 
articles. All of the articles were from the Wall Street 
Journal and were tagged in conjunction with the Penn 
Treebank project, which is described in (Marcus et at, 
1993). The motivation behind this experiment is that 
newspaper articles are about sufficiently different top- 
ics that discerning the boundaries between them should 
serve as a baseline measure of the algorithm's effective- 
ness. 





Expt. 1 


Expt. 2 


# of exact matches 


271 


106 


# of close matches 


196 


55 


# of extra boundaries 


1085 


32 


# of missed boundaries 


43 


349 


Precision 


0.175 


0.549 


Precision counting close 


0.300 


0.803 


Recall 


0.531 


0.208 


Recall counting close 


0.916 


0.304 



Table 1: Results of two experiments. 



The results of two experiments in which between two 
and eight randomly selected Wall Street Journal arti- 
cles were concatenated are shown in table 1. Both ex- 
periments were performed on the same data set which 
consisted of 150 concatenations of articles containing a 
total of 660 articles averaging 24.5 sentences in length. 
The average sentence length was 24.5 words. The differ- 
ence between the two experiments was that in the first 
experiment, boundaries were placed only at the ends of 
sentences, while in the second experiment, they were 
only placed at paragraph boundaries. Tuning the stop- 
ping criteria parameters in either method allows im- 
provements in precision to be traded for declines in re- 
call and vice versa. The first experiment demonstrates 
that high recall rates can be achieved and the second 
shows that high precision can also be achieved. 

In these tests, a minimum separation between bound- 
aries was imposed to prevent documents from being 
repeatedly subdivided around the location of one ac- 
tual boundary. For the purposes of evaluation, an exact 
match is one in which the algorithm placed a boundary 
at the same position as one existed in the collection of 
articles. A missed boundary is one for which the algo- 
rithm found no corresponding boundary. If a boundary 
was not an exact match, but was within three sentences 
of the correct location, the result was considered a close 
match. Precision and recall scores were computed both 
including and excluding the number of close matches. 
The precision and recall scores including close matches 
reflect the admission of only one close match per ac- 
tual boundary. It should be noted that some of the 
extra boundaries found may correspond to actual shifts 
in topic and may not be superfluous. 

Future Work 

The current implementation of the algorithm relies on 
part of speech information to detect closed class words 
and to find sentence boundaries. However, a larger 
common word list and a sentence boundary recognition 
algorithm could be employed to obviate the need for 
tags. Then the method could be easily applied to large 
amounts of text. Also, since the task of segmenting 
concatenated documents is quite artificial, the approach 
should be applied to finding topic boundaries. To this 
end, the algorithm's output should be compared to the 



segmentations produced by human judges and the sec- 
tion divisions authors insert into some forms of writing, 
such as technical writing. Additionally, the segment in- 
formation produced by the algorithm should be used 
in an information retrieval task as was done in (Hearst 
and Plaunt, 1993). Lastly, since this paper only exam- 
ined flat segmentations, work needs to be done to see 
whether useful hierarchical segmentations can be pro- 
duced. 
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